基于多元线性回归和支持向量机的HCV NS3/4A蛋白酶抑制剂的生物活性值的QSAR研究
QSAR studies of the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by multiple linear regression (MLR) and support vector machine (SVM)
Qin, Z.J.; Wang, M.L.; Yan, A.X.*
Bioorganic & Medicinal Chemistry Letters, 2017, 27, 2931–2938.
本研究使用了多种选择描述符组和训练/测试集的方法,采用多元线性回归(MLR)和支持向量机(SVM) 两种机器学习算法建立定量构效关系(QSAR)模型,以预测丙型肝炎病毒(HCV) NS3/4A蛋白酶抑制剂的生物活性。收集了已报道的512个HCV NS3/4A蛋白酶抑制剂及生物活性IC50值(由相同的FRET法测定得到)构建数据集。应用CORINA Symphony程序计算每个分子的9个全局描述符和12个二维自相关描述符进行表征。采用随机划分和Kohonen自组织映射(SOM)方法将数据集划分为训练集和测试集。最佳的MLR模型对训练集和测试集的相关系数(r2)分别为0.75和0.72,而最佳SVM模型对训练集和测试集的相关系数分别为0.87和0.85。此外,还开发了一系列子数据集模型。结果显示出所有子模型的预测效果均优于原模型。我们认为,将最优子模型和整个数据集的SVM模型进行组合可以作为研发新型NS3/4A蛋白酶抑制剂骨架的可靠先导设计工具。
In this study, quantitative structure-activity relationship (QSAR) models using various descriptor sets and training/test set selection methods were explored to predict the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by using a multiple linear regression (MLR) and a support vector machine (SVM) method. 512 HCV NS3/4A protease inhibitors and their IC50 values which were determined by the same FRET assay were collected from the reported literature to build a dataset. All the inhibitors were represented with selected nine global and 12 2D property-weighted autocorrelation descriptors calculated from the program CORINA Symphony. The dataset was divided into a training set and a test set by a random and a Kohonen’s self-organizing map (SOM) method. The correlation coefficients (r2) of training sets and test sets were 0.75 and 0.72 for the best MLR model, 0.87 and 0.85 for the best SVM model, respectively. In addition, a series of sub-dataset models were also developed. The performances of all the best sub-dataset models were better than those of the whole dataset models. We believe that the combination of the best sub- and whole dataset SVM models can be used as reliable lead designing tools for new NS3/4A protease inhibitors scaffolds in a drug discovery pipeline.
QSAR Models performance: Dataset 1 (512 inhibitors)
Model Name | Algorithm | Descriptors | Spliting methods | Training set r2 | Training set sd | Training set MAE | Test set r2 | Test set sd | Test set MAE |
---|---|---|---|---|---|---|---|---|---|
Model A1 | MLR | 7 CORINA Global | Random | 0.67 | 0.95 | 0.73 | 0.58 | 1.05 | 0.76 |
Model A2 | MLR | 7 CORINA Global | Kohonen’s self-organizing map (SOM) | 0.64 | 0.98 | 0.76 | 0.65 | 0.95 | 0.74 |
Model B1 | MLR | 7 CORINA Global | Random | 0.64 | 0.98 | 0.77 | 0.54 | 1.10 | 0.81 |
Model B2 | MLR | 7 CORINA Global | Kohonen’s self-organizing map (SOM) | 0.60 | 1.03 | 0.81 | 0.63 | 0.98 | 0.75 |
Model C1 | MLR | 2 CORINA Global 8 CORINA 2D | Random | 0.77 | 0.80 | 0.62 | 0.67 | 0.93 | 0.65 |
Model C2 | MLR | 2 CORINA Global 8 CORINA 2D | Kohonen’s self-organizing map (SOM) | 0.75 | 0.82 | 0.62 | 0.72 | 0.87 | 0.66 |
Model D1 | MLR | 2 CORINA Global 9 CORINA 2D | Random | 0.75 | 0.83 | 0.65 | 0.67 | 0.94 | 0.68 |
Model D2 | MLR | 2 CORINA Global 9 CORINA 2D | Kohonen’s self-organizing map (SOM) | 0.73 | 0.86 | 0.66 | 0.72 | 0.87 | 0.65 |
Model A3 | SVM | 7 CORINA Global | Random | 0.80 | 0.73 | 0.53 | 0.72 | 0.84 | 0.60 |
Model A4 | SVM | 7 CORINA Global | Kohonen’s self-organizing map (SOM) | 0.78 | 0.77 | 0.55 | 0.79 | 0.72 | 0.56 |
Model B3 | SVM | 7 CORINA Global | Random | 0.81 | 0.72 | 0.52 | 0.73 | 0.83 | 0.58 |
Model B4 | SVM | 7 CORINA Global | Kohonen’s self-organizing map (SOM) | 0.79 | 0.75 | 0.52 | 0.80 | 0.73 | 0.53 |
Model C3 | SVM | 2 CORINA Global 8 CORINA 2D | Random | 0.90 | 0.54 | 0.42 | 0.75 | 0.82 | 0.55 |
Model C4 | SVM | 2 CORINA Global 8 CORINA 2D | Kohonen’s self-organizing map (SOM) | 0.88 | 0.56 | 0.40 | 0.83 | 0.68 | 0.50 |
Model D3 | SVM | 2 CORINA Global 9 CORINA 2D | Random | 0.90 | 0.53 | 0.41 | 0.81 | 0.70 | 0.49 |
Model D4 | SVM | 2 CORINA Global 9 CORINA 2D | Kohonen’s self-organizing map (SOM) | 0.87 | 0.59 | 0.42 | 0.85 | 0.63 | 0.47 |
QSAR models: Dataset 2 (355 linear inhibitors from dataset1)
Model | Spliting methods | Algorithm | Descriptors | Training set r2 | Training set sd | Training set MAE | Test set r2 | Test set sd | Test set MAE |
---|---|---|---|---|---|---|---|---|---|
Model C2 (for predicting 355 linear inhibitors) | Kohonen’s self-organizing map (SOM) | MLR | 2 CORINA Global 8 CORINA 2D | 0.74 | 0.84 | 0.62 | 0.68 | 0.91 | 0.67 |
Model LA1 | Kohonen’s self-organizing map (SOM) | MLR | 2 CORINA Global 8 CORINA 2D | 0.77 | 0.78 | 0.59 | 0.77 | 0.77 | 0.59 |
Model D4 (for predicting 355 linear inhibitors) | Kohonen’s self-organizing map (SOM) | SVM | 2 CORINA Global 8 CORINA 2D | 0.86 | 0.62 | 0.44 | 0.83 | 0.68 | 0.49 |
Model LB2 | Kohonen’s self-organizing map (SOM) | SVM | 2 CORINA Global 8 CORINA 2D | 0.87 | 0.59 | 0.43 | 0.85 | 0.63 | 0.45 |
QSAR models: Dataset 3 (157 macrocyclic inhibitors from dataset1)
Model | Spliting methods | Algorithm | Descriptors | Training set r2 | Training set sd | Training set MAE | Test set r2 | Test set sd | Test set MAE |
---|---|---|---|---|---|---|---|---|---|
Model C2 (for predicting 157 macrocyclic inhibitors) | Kohonen’s self-organizing map (SOM) | MLR | 2 CORINA Global 8 CORINA 2D | 0.29 | 0.81 | 0.60 | 0.32 | 0.86 | 0.62 |
Model MC1 | Kohonen’s self-organizing map (SOM) | MLR | 2 CORINA Global 8 CORINA 2D | 0.58 | 0.57 | 0.41 | 0.47 | 0.66 | 0.47 |
Model D4 (for predicting 157 macrocyclic inhibitors) | Kohonen’s self-organizing map (SOM) | SVM | 2 CORINA Global 8 CORINA 2D | 0.60 | 0.56 | 0.39 | 0.55 | 0.62 | 0.41 |
Model MD2 | Kohonen’s self-organizing map (SOM) | SVM | 2 CORINA Global 8 CORINA 2D | 0.76 | 0.45 | 0.28 | 0.67 | 0.50 | 0.35 |
主要项目成员
博士研究生
zijianqin@foxmail.com